This project has as its final goal the creation of a statistical model capable of predicting newborn weight on the basis of clinical data collected on 2500 newborns from three hospitals.

First, a set of libraries that will be used throughout the analysis is imported and the newborns.csv dataset is loaded.

library(dplyr)
library(kableExtra)
library(moments) 
library(ggplot2)
library(GGally)
library(ggcorrplot)
library(patchwork)
library(car)
library(MASS)
library(Metrics)
library(lmtest)
library(plotly)
df = read.csv("newborns.csv", stringsAsFactors = TRUE)
attach(df)
kbl(head(df)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mother.age Pregnancies.n Smoker Gestation.w Weight Length Cranium Birth.type Hospital Sex
26 0 0 42 3380 490 325 Nat osp3 M
21 2 0 39 3150 490 345 Nat osp1 F
34 3 0 38 3640 500 375 Nat osp2 M
28 1 0 41 3690 515 365 Nat osp2 M
20 0 0 38 3700 480 335 Nat osp3 F
32 0 0 40 3200 495 340 Nat osp2 F

1. Preliminary Analysis

In the first phase, a descriptive analysis is carried out in order to explore the variables, understand their distribution and identify any anomalies.

kbl(summary(df)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Mother.age Pregnancies.n Smoker Gestation.w Weight Length Cranium Birth.type Hospital Sex
Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. :25.00 Min. : 830 Min. :310.0 Min. :235 Ces: 728 osp1:816 F:1256
1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Nat:1772 osp2:849 M:1244
Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00 Median :3300 Median :500.0 Median :340 NA osp3:835 NA
Mean :28.16 Mean : 0.9812 Mean :0.0416 Mean :38.98 Mean :3284 Mean :494.7 Mean :340 NA NA NA
3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350 NA NA NA
Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00 Max. :4930 Max. :565.0 Max. :390 NA NA NA

The normality of the Weight variable is checked, as it will be considered as the response variable in the regression model.

# Calculation of skewness and kurtosis
paste("Skewness:", round(skewness(Weight), 3))
## [1] "Skewness: -0.647"
paste("Kurtosis:", round(kurtosis(Weight), 3))
## [1] "Kurtosis: 5.032"

The Weight variable has a negative skewness index, so the distribution is slightly left-skewed (with a longer left tail), and a positive kurtosis index, i.e. it has a leptokurtic distribution, as can also be seen from the density plot.

ggplot(df) +
  geom_density(aes(x = Weight), col = "black", fill = "steelblue") +
  labs(title = "Density of newborn weight",
       x = "Weight (g)", y = "Density")

# Shapiro–Wilk test to check normality
shapiro.test(Weight)
## 
##  Shapiro-Wilk normality test
## 
## data:  Weight
## W = 0.97066, p-value < 2.2e-16

Running the Shapiro–Wilk test, the null hypothesis of normality for the Weight variable is rejected. In this case, since the distribution of the response variable is not normal, the residuals of the regression model may also not have a normal distribution (which is one of the assumptions of linear regression), but this will be checked after the model has been created.

Next, a correlation matrix is created to assess the relationships between the continuous numeric variables, and it is visualised graphically using the ggcorrplot package.

# Correlation matrix (only for continuous numeric variables)
cor_matrix <- cor(df %>% dplyr::select(Weight, Length, Cranium, Mother.age, Gestation.w))
# Visualisation of the correlation matrix with ggcorrplot
ggcorrplot(cor_matrix, hc.order = TRUE, lab = TRUE)

Among the explanatory variables there do not appear to be strong correlations that would cause multicollinearity issues for the regression model, but this aspect will be further checked later using the Variance Inflation Factor (VIF).

The relationships between the continuous variables are also visualised by means of a scatter plot matrix (pair plot), created with the ggpairs function from the GGally package.

# Pair plot for the continuous numeric variables

# Function to modify the parameters of the scatter plots
lower_func <- function(data, mapping, ...) {
  ggplot(data, mapping) + 
    geom_point(size = 0.7, alpha = 0.5, position = "jitter") +
    geom_smooth(method = "lm", color = "red")
}

pp = ggpairs(df, 
             columns = c("Weight", "Length", "Cranium", "Mother.age", "Gestation.w"),
             lower = list(continuous = lower_func))
pp[5,1] = pp[5,1] + scale_x_continuous(breaks = c(1500, 3000, 4500))
pp[5,2] = pp[5,2] + scale_x_continuous(breaks = c(350, 450, 550))
pp

The impact of the qualitative variables Sex, Smoker (mother smoking status) and Birth.type on the Weight variable is then explored using boxplots.

# Boxplots to explore Weight with respect to qualitative variables
bp1 = ggplot(df, aes(x = "", y = Weight)) + 
  geom_boxplot(fill = "steelblue") +
  ggtitle("Distribution of Weight (all newborns)")

bp2 = ggplot(df, aes(x = Sex, y = Weight, fill = Sex)) + 
  geom_boxplot() +
  ggtitle("Distribution of Weight by Sex")

bp3 = ggplot(df, aes(x = as.factor(Smoker), y = Weight, fill = as.factor(Smoker))) + 
  geom_boxplot() +
  ggtitle("Distribution of Weight by smoking status") + 
  scale_x_discrete(name = "Smoker", labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_discrete(name = "Smoker", labels = c("No", "Yes"))

bp4 = ggplot(df, aes(x = Birth.type, y = Weight, fill = Birth.type)) + 
  geom_boxplot() +
  ggtitle("Distribution of Weight by birth type")

bp1 + bp2 + bp3 + bp4

2. Hypothesis Testing

In this section the following hypotheses are now tested using the appropriate statistical tests:

  1. In some hospitals more caesarean sections are performed
  2. The means of Weight and Length in this sample of newborns are significantly equal to those of the population
  3. The anthropometric measurements (Weight, Length, Cranium) are significantly different between the two sexes

For the first point, it is checked whether the distribution of cesarean sections varies significantly between hospitals. A contingency table is then created between the variables Hospital and Birth.type, and a Chi-square test is performed, which is appropriate for testing the hypothesis of independence between qualitative variables.

# Contingency table between Hospital and Birth.type
tab_births <- table(df$Hospital, df$Birth.type)
kbl(tab_births) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Ces Nat
osp1 242 574
osp2 254 595
osp3 232 603
# Chi-squared test
chi2_test <- chisq.test(tab_births)
print(chi2_test)
## 
##  Pearson's Chi-squared test
## 
## data:  tab_births
## X-squared = 1.0972, df = 2, p-value = 0.5778

From the chi-squared test we obtain a p-value of \(0.577\). Therefore, the hypothesis of independence between the variables Hospital and Birth.type cannot be rejected, i.e., there is no significant difference in the birth type between the three hospitals. Thus, no hospital performs significantly more cesarean sections.

For the second point, the sample means of the variables Weight and Length are compared with the reference means of the population (Weight = 3300 g, Length = 500 mm, source: Ospedale Bambino Gesu) to verify that they are equal. In this case, it is appropriate to use a one-sample t-test if the data are normal or a Wilcoxon test if they are not.

It has already been checked, via the Shapiro–Wilk test, that the distribution of the Weight variable is not normal, so a Wilcoxon test is carried out.

# Wilcoxon test for the variable Weight
wilcox.test(Weight, mu = 3300)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Weight
## V = 1495594, p-value = 0.9612
## alternative hypothesis: true location is not equal to 3300

The Wilcoxon test yields a p-value of \(0.961\). Therefore the null hypothesis cannot be rejected: the mean weight of the newborns in the sample can be considered not significantly different from that of the population.

A Shapiro–Wilk test is now performed to check the normality of the Length variable.

shapiro.test(Length)
## 
##  Shapiro-Wilk normality test
## 
## data:  Length
## W = 0.90941, p-value < 2.2e-16

Here too the Length variable does not follow a normal distribution, so a Wilcoxon test is carried out with null hypothesis of equality to the population mean (500 mm).

# Wilcoxon test for the variable Length
wilcox.test(Length, mu = 500)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Length
## V = 877236, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 500

With a p-value \(< 2.2\times10^{-16}\) the null hypothesis that the mean length of this sample of newborns is equal to that of the population is rejected. Therefore, it can be stated that the mean length calculated for this sample, equal to 494.7, is significantly lower than the population mean.

For the third point, the distributions of Weight, Length, and Cranium between male and female newborns are compared to verify that they are significantly different between the two sexes. In this case, a Two-sample independent t-test (defined based on the categorical variable Sex) is used if the data are normal, or a Wilcoxon test if they are not. It had already been verified, using the Shapiro-Wilk test, that the data for the variables Weight and Length do not have a normal distribution, so the Wilcoxon test is performed.

# Wilcoxon test for the variable Weight by Sex
wilcox.test(Weight ~ Sex)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Weight by Sex
## W = 538641, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

With a p-value \(< 2.2\times10^{-16}\) the null hypothesis that the mean weight is the same in the two sexes is rejected: there is a significant difference in weight between males and females.

# Wilcoxon test for the variable Length by Sex
wilcox.test(Length ~ Sex)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Length by Sex
## W = 594455, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Here again, with a p-value \(< 2.2\times10^{-16}\) the null hypothesis that the mean length is the same in the two sexes is rejected: the mean length differs significantly between males and females.

A Shapiro–Wilk test is now performed to check the normality of the Cranium variable.

shapiro.test(Cranium)
## 
##  Shapiro-Wilk normality test
## 
## data:  Cranium
## W = 0.96357, p-value < 2.2e-16

Since the Cranium variable also does not have a normal distribution, a Wilcoxon test is carried out.

# Wilcoxon test for the variable Cranium by Sex
wilcox.test(Cranium ~ Sex)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Cranium by Sex
## W = 641638, p-value = 9.633e-15
## alternative hypothesis: true location shift is not equal to 0

With a p-value of \(9.633 \times 10^{-15}\) the null hypothesis that the mean cranium diameter is the same between the two sexes is rejected: the measurement differs significantly between males and females.

3. Building the Regression Model

In this phase, a multiple linear regression model is developed in which Weight is the response variable and all variables in the dataset are included as explanatory variables. In this way, it is possible to quantify the impact of each independent variable on newborn weight.

# Multiple linear regression model with all predictors
mod1 <- lm(Weight ~ ., data = df)
# Model summary
summary(mod1)
## 
## Call:
## lm(formula = Weight ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1124.40  -181.66   -14.42   160.91  2611.89 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6738.4762   141.3087 -47.686  < 2e-16 ***
## Mother.age        0.8921     1.1323   0.788   0.4308    
## Pregnancies.n    11.2665     4.6608   2.417   0.0157 *  
## Smoker          -30.1631    27.5386  -1.095   0.2735    
## Gestation.w      32.5696     3.8187   8.529  < 2e-16 ***
## Length           10.2945     0.3007  34.236  < 2e-16 ***
## Cranium          10.4707     0.4260  24.578  < 2e-16 ***
## Birth.typeNat    29.5254    12.0844   2.443   0.0146 *  
## Hospitalosp2    -11.2095    13.4379  -0.834   0.4043    
## Hospitalosp3     28.0958    13.4957   2.082   0.0375 *  
## SexM             77.5409    11.1776   6.937 5.08e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.9 on 2489 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7278 
## F-statistic: 669.2 on 10 and 2489 DF,  p-value: < 2.2e-16

The model summary shows the regression coefficient estimates for all variables, which represent the marginal effects of the individual explanatory variables on the response variable Weight. However, although all variables (with the exception of the Mother.age variable ) have an effect on the Weight variable, not all variables are significant, as indicated by the high p-values.

This suggests the possibility of developing a better model by excluding some of the explanatory variables included in this initial model. In particular, it is likely that eliminating the variables Mother.age, Smoker, and Hospital would result in a better and more parsimonious model.

The Variance Inflation Factor (VIF) is also computed to assess possible multicollinearity between the independent variables included in the model.

vif(mod1)
##                   GVIF Df GVIF^(1/(2*Df))
## Mother.age    1.187454  1        1.089704
## Pregnancies.n 1.186428  1        1.089233
## Smoker        1.007392  1        1.003689
## Gestation.w   1.695810  1        1.302233
## Length        2.085755  1        1.444214
## Cranium       1.630796  1        1.277026
## Birth.type    1.004242  1        1.002119
## Hospital      1.004071  2        1.001016
## Sex           1.040643  1        1.020119

Since the VIF values are below 5 for all variables, there is no evidence of problematic multicollinearity, as also suggested above by observing the correlation matrix. Therefore, it is not necessary to remove variables for this reason.

4. Selection of the Optimal Model

In this phase, the most parsimonious model is selected, i.e. the model that balances goodness of fit and complexity, thus eliminating non-significant variables. This procedure is performed using the stepAIC function of the “MASS” library, which applies a stepwise selection to minimize the model’s information criterion. In this case, the BIC (Bayesian Information Criterion) is used as criterion, which tends to be more parsimonious in the selection, retaining only variables that are strictly relevant.

# Selection of the optimal model with stepwise procedure based on BIC
n = nrow(df)
stepwise.mod = stepAIC(mod1, direction = "both", k = log(n))
## Start:  AIC=28139.32
## Weight ~ Mother.age + Pregnancies.n + Smoker + Gestation.w + 
##     Length + Cranium + Birth.type + Hospital + Sex
## 
##                 Df Sum of Sq       RSS   AIC
## - Mother.age     1     46578 186809099 28132
## - Smoker         1     90019 186852540 28133
## - Hospital       2    685979 187448501 28133
## - Pregnancies.n  1    438452 187200974 28137
## - Birth.type     1    447929 187210450 28138
## <none>                       186762521 28139
## - Sex            1   3611021 190373542 28179
## - Gestation.w    1   5458403 192220925 28204
## - Cranium        1  45326172 232088693 28675
## - Length         1  87951062 274713583 29096
## 
## Step:  AIC=28132.12
## Weight ~ Pregnancies.n + Smoker + Gestation.w + Length + Cranium + 
##     Birth.type + Hospital + Sex
## 
##                 Df Sum of Sq       RSS   AIC
## - Smoker         1     90897 186899996 28126
## - Hospital       2    692738 187501837 28126
## - Birth.type     1    448222 187257321 28130
## <none>                       186809099 28132
## - Pregnancies.n  1    633756 187442855 28133
## + Mother.age     1     46578 186762521 28139
## - Sex            1   3618736 190427835 28172
## - Gestation.w    1   5412879 192221978 28196
## - Cranium        1  45588236 232397335 28670
## - Length         1  87950050 274759149 29089
## 
## Step:  AIC=28125.51
## Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Birth.type + 
##     Hospital + Sex
## 
##                 Df Sum of Sq       RSS   AIC
## - Hospital       2    701680 187601677 28119
## - Birth.type     1    440684 187340680 28124
## <none>                       186899996 28126
## - Pregnancies.n  1    610840 187510837 28126
## + Smoker         1     90897 186809099 28132
## + Mother.age     1     47456 186852540 28133
## - Sex            1   3602797 190502794 28165
## - Gestation.w    1   5346781 192246777 28188
## - Cranium        1  45632149 232532146 28664
## - Length         1  88355030 275255027 29086
## 
## Step:  AIC=28119.23
## Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Birth.type + 
##     Sex
## 
##                 Df Sum of Sq       RSS   AIC
## - Birth.type     1    463870 188065546 28118
## <none>                       187601677 28119
## - Pregnancies.n  1    651066 188252743 28120
## + Hospital       2    701680 186899996 28126
## + Smoker         1     99840 187501837 28126
## + Mother.age     1     54392 187547285 28126
## - Sex            1   3649259 191250936 28160
## - Gestation.w    1   5444109 193045786 28183
## - Cranium        1  45758101 233359778 28657
## - Length         1  88054432 275656108 29074
## 
## Step:  AIC=28117.58
## Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Sex
## 
##                 Df Sum of Sq       RSS   AIC
## <none>                       188065546 28118
## - Pregnancies.n  1    623141 188688687 28118
## + Birth.type     1    463870 187601677 28119
## + Hospital       2    724866 187340680 28124
## + Smoker         1     91892 187973654 28124
## + Mother.age     1     54816 188010731 28125
## - Sex            1   3655292 191720838 28158
## - Gestation.w    1   5464853 193530399 28181
## - Cranium        1  46108583 234174130 28658
## - Length         1  87632762 275698308 29066
summary(stepwise.mod)
## 
## Call:
## lm(formula = Weight ~ Pregnancies.n + Gestation.w + Length + 
##     Cranium + Sex, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1149.44  -180.81   -15.58   163.64  2639.72 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6681.1445   135.7229 -49.226  < 2e-16 ***
## Pregnancies.n    12.4750     4.3396   2.875  0.00408 ** 
## Gestation.w      32.3321     3.7980   8.513  < 2e-16 ***
## Length           10.2486     0.3006  34.090  < 2e-16 ***
## Cranium          10.5402     0.4262  24.728  < 2e-16 ***
## SexM             77.9927    11.2021   6.962 4.26e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.6 on 2494 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.7265 
## F-statistic:  1328 on 5 and 2494 DF,  p-value: < 2.2e-16

After the selection procedure, the best model according to BIC includes the following variables: Pregnancies.n, Gestation.w, Length, Cranium, Sex.

# BIC comparison
BIC(mod1, stepwise.mod)
##              df      BIC
## mod1         12 35241.84
## stepwise.mod  7 35220.10

Compared to the initial model, the selected model has a lower BIC, and is therefore preferable from the point of view of both simplicity and interpretation.

Models with interactions between some of the predictors are now tested. In particular, a model including the interaction between Pregnancies.n and Gestation.w is first considered:

# Model with interaction between Pregnancies.n and Gestation.w
mod_int_1 = lm(
  formula = Weight ~ Pregnancies.n * Gestation.w + Length + Cranium + Sex,
  data = df
)
summary(mod_int_1)
## 
## Call:
## lm(formula = Weight ~ Pregnancies.n * Gestation.w + Length + 
##     Cranium + Sex, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1148.72  -180.77   -14.14   163.93  2638.53 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -6588.5062   158.7705 -41.497  < 2e-16 ***
## Pregnancies.n               -66.1971    70.1095  -0.944    0.345    
## Gestation.w                  29.9109     4.3658   6.851 9.20e-12 ***
## Length                       10.2482     0.3006  34.091  < 2e-16 ***
## Cranium                      10.5463     0.4263  24.741  < 2e-16 ***
## SexM                         77.9137    11.2017   6.956 4.47e-12 ***
## Pregnancies.n:Gestation.w     2.0278     1.8036   1.124    0.261    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.6 on 2493 degrees of freedom
## Multiple R-squared:  0.7271, Adjusted R-squared:  0.7265 
## F-statistic:  1107 on 6 and 2493 DF,  p-value: < 2.2e-16

In this case, the interaction term (Pregnancies.n:Gestation.w) is not significant (p-value = 0.261).

BIC(stepwise.mod, mod_int_1)
##              df      BIC
## stepwise.mod  7 35220.10
## mod_int_1     8 35226.66

The BIC of this model is also higher than that of the stepwise model, so the interaction between Pregnancies.n and Gestation.w does not improve the model.

A model with interaction between Length and Cranium is then tested:

# Model with interaction between Sex and Gestation.w
mod_int_2 = lm(
  formula = Weight ~ Pregnancies.n + Gestation.w + Length * Cranium + Sex,
  data = df
)
summary(mod_int_2)
## 
## Call:
## lm(formula = Weight ~ Pregnancies.n + Gestation.w + Length * 
##     Cranium + Sex, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1150.74  -180.48   -13.62   165.82  2866.17 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.801e+03  1.018e+03  -1.770  0.07685 .  
## Pregnancies.n   1.295e+01  4.321e+00   2.996  0.00276 ** 
## Gestation.w     3.810e+01  3.964e+00   9.610  < 2e-16 ***
## Length         -3.047e-01  2.202e+00  -0.138  0.88996    
## Cranium        -4.759e+00  3.191e+00  -1.491  0.13601    
## SexM            7.327e+01  1.119e+01   6.545 7.20e-11 ***
## Length:Cranium  3.158e-02  6.528e-03   4.838 1.39e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.4 on 2493 degrees of freedom
## Multiple R-squared:  0.7295, Adjusted R-squared:  0.7289 
## F-statistic:  1121 on 6 and 2493 DF,  p-value: < 2.2e-16

The interaction term (Length:Cranium) is here significant (p-value = \(1.39 \times 10^{-6}\))

BIC(stepwise.mod, mod_int_2)
##              df      BIC
## stepwise.mod  7 35220.10
## mod_int_2     8 35204.57

In this case the BIC is also lower than that of the stepwise model. An ANOVA test is also performed, which takes into account the variance explained by the two models:

anova(mod_int_2, stepwise.mod)
## Analysis of Variance Table
## 
## Model 1: Weight ~ Pregnancies.n + Gestation.w + Length * Cranium + Sex
## Model 2: Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Sex
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1   2493 186316621                                  
## 2   2494 188065546 -1  -1748926 23.401 1.395e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result of the ANOVA test confirms that there is a significant increase in explained variance when the interaction is added, so mod_int_2 can be considered a better model.

Possible non-linear effects are now tested. In particular, by observing the prevously created scatter plots, the quadratic effect of the Length variable is first added to the model:

# Model with Length^2
mod_quad_1 = update(mod_int_2, ~ . + I(Length^2))
summary(mod_quad_1)
## 
## Call:
## lm(formula = Weight ~ Pregnancies.n + Gestation.w + Length + 
##     Cranium + Sex + I(Length^2) + Length:Cranium, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1176.69  -179.20   -11.78   165.68  1306.79 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.262e+03  1.003e+03  -2.256 0.024135 *  
## Pregnancies.n   1.425e+01  4.254e+00   3.350 0.000821 ***
## Gestation.w     4.042e+01  3.909e+00  10.339  < 2e-16 ***
## Length         -2.137e+01  3.169e+00  -6.744 1.91e-11 ***
## Cranium         2.741e+01  4.725e+00   5.800 7.46e-09 ***
## SexM            7.186e+01  1.102e+01   6.523 8.29e-11 ***
## I(Length^2)     4.476e-02  4.914e-03   9.109  < 2e-16 ***
## Length:Cranium -3.449e-02  9.689e-03  -3.560 0.000378 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 269 on 2492 degrees of freedom
## Multiple R-squared:  0.7383, Adjusted R-squared:  0.7375 
## F-statistic:  1004 on 7 and 2492 DF,  p-value: < 2.2e-16
BIC(mod_int_2, mod_quad_1)
##            df      BIC
## mod_int_2   8 35204.57
## mod_quad_1  9 35130.51

The quadratic term of the Length variable resulted to be very significant (p-value \(< 2\times10^{-16}\) ). In addition, there is a further decrease in BIC and also a one point increase in the adjusted R-squared coefficient.

The quadratic effect of the Gestation.w variable is now also added:

# Model with Gestation.w^2
mod_quad_2 = update(mod_quad_1, ~ . + I(Gestation.w^2))
summary(mod_quad_2)
## 
## Call:
## lm(formula = Weight ~ Pregnancies.n + Gestation.w + Length + 
##     Cranium + Sex + I(Length^2) + I(Gestation.w^2) + Length:Cranium, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1191.64  -181.49   -12.11   165.26  1325.43 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -3.400e+03  1.048e+03  -3.244 0.001193 ** 
## Pregnancies.n     1.451e+01  4.245e+00   3.418 0.000640 ***
## Gestation.w       2.862e+02  6.769e+01   4.228 2.44e-05 ***
## Length           -3.082e+01  4.091e+00  -7.532 6.93e-14 ***
## Cranium           2.042e+01  5.090e+00   4.012 6.19e-05 ***
## SexM              7.328e+01  1.100e+01   6.664 3.28e-11 ***
## I(Length^2)       4.947e-02  5.070e-03   9.757  < 2e-16 ***
## I(Gestation.w^2) -3.227e+00  8.872e-01  -3.637 0.000281 ***
## Length:Cranium   -2.046e-02  1.041e-02  -1.966 0.049358 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 268.3 on 2491 degrees of freedom
## Multiple R-squared:  0.7396, Adjusted R-squared:  0.7388 
## F-statistic: 884.6 on 8 and 2491 DF,  p-value: < 2.2e-16
BIC(mod_quad_1, mod_quad_2)
##            df      BIC
## mod_quad_1  9 35130.51
## mod_quad_2 10 35125.09

The quadratic term of the Gestation.w variable also results to be highly significant (p-value = \(0.0002\)). Moreover, there is a further decrease in BIC and a slight increase in the adjusted R-squared. An ANOVA test is performed to compare the models:

anova(mod_quad_2, mod_int_2)
## Analysis of Variance Table
## 
## Model 1: Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Sex + 
##     I(Length^2) + I(Gestation.w^2) + Length:Cranium
## Model 2: Weight ~ Pregnancies.n + Gestation.w + Length * Cranium + Sex
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1   2491 179360498                                  
## 2   2493 186316621 -2  -6956123 48.304 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result of the ANOVA test confirms that there is a significant gain in explained variance when the quadratic terms of Length and Gestation.w variables are included; mod_quad_2 is therefore considered the best model among those fitted.

5. Analysis of Model Quality

In this phase, the predictive ability of the final model is first compared with that of the previous models, using three metrics:

RMSE is a metric that represents the average distance between the observed values and the values predicted by the model. A lower RMSE indicates a better ability of the model to predict data.

# List of models
models <- list(
  "Base" = mod1,
  "Stepwise" = stepwise.mod,
  "Interaction" = mod_int_2,
  "Quadratic" = mod_quad_2
)
# Computation of the metrics for each model
metrics <- data.frame(
  BIC = sapply(models, BIC),
  Adjusted_R2 = sapply(models, function(m) summary(m)$adj.r.squared),
  RMSE = sapply(models, function(m) rmse(df$Weight, predict(m)))
)
kbl(metrics) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
BIC Adjusted_R2 RMSE
Base 35241.84 0.7278038 273.3222
Stepwise 35220.10 0.7264542 274.2740
Interaction 35204.57 0.7288893 272.9957
Quadratic 35125.09 0.7388017 267.8511

From the summary table, all the metrics (lower BIC, higher adjusted R-squared, lower RMSE) are better for the final model, which includes the quadratic terms, confirming that it is the most appropriate among those considered.

A residual analysis of the model is now carried out.

par(mfrow = c(2, 2))
plot(mod_quad_2, pch = 20)

From the diagnostic plots, residuals are largely normally distributed, but there are some values, particularly in the upper tail, showing some deviations from normality. Furthermore, the residuals do not appear to be evenly distributed around the mean of 0, i.e. they show a slight heteroscedasticity.

Lastly, looking at the last of the four standard diagnostic plots, which shows potential influential observations on regression estimates (Cook’s distance), it can be seen that one of the observations exceeds both problematic thresholds of 0.5 and 1.

Residual tests are now performed.

# Normality test (Shapiro–Wilk)
shapiro.test(residuals(mod_quad_2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod_quad_2)
## W = 0.99039, p-value = 6.954e-12
ggplot() +
  geom_density(aes(x = residuals(mod_quad_2)), col = "black", fill = "steelblue") +
  labs(title = "Density of residuals",
       x = "Residuals", y = "Density")

The Shapiro-Wilk test confirms that the residuals do not appear to be perfectly normal. As can be seen from the density graph, this is mainly due to some extreme values in the two tails.

# Homoscedasticity test (Breusch-Pagan)
bptest(mod_quad_2)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod_quad_2
## BP = 134.53, df = 8, p-value < 2.2e-16

The Breusch-Pagan test for homoscedasticity also indicates that the variance of the residuals is not constant, as previously suggested by the diagnostic plots.

# Test of lack of autocorrelation (Durbin–Watson)
dwtest(mod_quad_2)
## 
##  Durbin-Watson test
## 
## data:  mod_quad_2
## DW = 1.9471, p-value = 0.09295
## alternative hypothesis: true autocorrelation is greater than 0

From the Durbin–Watson test, the null hypothesis of no autocorrelation is not rejected, so the residuals do not show evidence of autocorrelation.

To identify precisely any influential values, leverage values (hat values) are examined, which correspond to observations located far out in the space of the explanatory variables.

# Leverage values
hat = hatvalues(mod_quad_2)
p = sum(hat)
threshold = 2 * p/n
plot(hat, pch = 20, main = "Leverage", ylab = "Hat values")
abline(h = threshold, col = 2)

From the plot there are several leverage values that exceed the threshold (2·p/n), one of which by a considerable margin.

Extreme values of the response variable (outliers) are also identified through the analysis of studentized residuals.

# Studentized residuals
plot(rstudent(mod_quad_2), pch = 20,
     main = "Studentized residuals", ylab = "Studentized residuals")

# Outlier test based on studentized residuals
outlierTest(mod_quad_2)
##       rstudent unadjusted p-value Bonferroni p
## 1551  6.249419         4.8311e-10   1.2078e-06
## 1306  4.968338         7.2110e-07   1.8028e-03
## 1399 -4.463879         8.4073e-06   2.1018e-02
## 155   4.387990         1.1917e-05   2.9792e-02
## 1694  4.290034         1.8546e-05   4.6366e-02

By performing a test on studentized residuals, it appears that in this case 5 values are classified as outliers, and therefore potentially influential values.

The combined effect of leverage and outliers is now considered by computing and visualising Cook’s distances:

cook = cooks.distance(mod_quad_2)
plot(cook, pch = 20, main = "Cook's distance")
abline(h = 0.5, col = 2)

# Values above the Cook's distance threshold of 0.5
cook[cook > 0.5]
##     1551 
## 5.311502

As already noted, only one value exceeds the Cook’s distance threshold of 0.5 and can therefore be considered influential. However, assuming that the observation is not a measurement error, the model can be considered sufficiently adequate.

6. Predictions and Results

In this phase, the model is used to make predictions, estimating the weight of a newborn by considering different combinations of values for the explanatory variables. To simplify the process, a function is created that takes the values of the explanatory variables of the model as arguments (with the median values of the dataset set as default values) and returns the prediction of the Weight variable.

# Function to predict newborn weight
weight_prediction = function(Pregnancies.n = 1, 
                             Gestation.w = 39, 
                             Length = 500, 
                             Cranium = 340, 
                             Sex = "M") {
  
  new_data <- data.frame(Pregnancies.n = c(Pregnancies.n), 
                         Gestation.w = c(Gestation.w), 
                         Length = c(Length),
                         Cranium = c(Cranium),
                         Sex = as.factor(c(Sex)))
  
  predicted_weight = predict(mod_quad_2, newdata = new_data)
  
  return(cat("Based on the following values:\n",
             "\nNumber of pregnancies:", Pregnancies.n,
             "\nWeeks of gestation:", Gestation.w,
             "\nLength of the newborn (mm):", Length,
             "\nCranium diameter of the newborn (mm):", Cranium, 
             "\nSex of the newborn:", Sex, 
             "\n\nThe estimated weight of the newborn (g) is:", predicted_weight))
}

Without specifying any arguments, the function returns the prediction with the default values:

weight_prediction()
## Based on the following values:
##  
## Number of pregnancies: 1 
## Weeks of gestation: 39 
## Length of the newborn (mm): 500 
## Cranium diameter of the newborn (mm): 340 
## Sex of the newborn: M 
## 
## The estimated weight of the newborn (g) is: 3364.558

Setting the Sex variable to "F", the function returns the prediction for a female newborn:

weight_prediction(Sex = "F")
## Based on the following values:
##  
## Number of pregnancies: 1 
## Weeks of gestation: 39 
## Length of the newborn (mm): 500 
## Cranium diameter of the newborn (mm): 340 
## Sex of the newborn: F 
## 
## The estimated weight of the newborn (g) is: 3291.283

It is possible to insert different combinations of arguments (e.g. Pregnancies.n, Gestation.w, Sex) to modify the values of the variables and return the corresponding prediction:

weight_prediction(Pregnancies.n = 3)
## Based on the following values:
##  
## Number of pregnancies: 3 
## Weeks of gestation: 39 
## Length of the newborn (mm): 500 
## Cranium diameter of the newborn (mm): 340 
## Sex of the newborn: M 
## 
## The estimated weight of the newborn (g) is: 3393.577
weight_prediction(Pregnancies.n = 3, Sex = "F")
## Based on the following values:
##  
## Number of pregnancies: 3 
## Weeks of gestation: 39 
## Length of the newborn (mm): 500 
## Cranium diameter of the newborn (mm): 340 
## Sex of the newborn: F 
## 
## The estimated weight of the newborn (g) is: 3320.302
weight_prediction(Gestation.w = 28)
## Based on the following values:
##  
## Number of pregnancies: 1 
## Weeks of gestation: 28 
## Length of the newborn (mm): 500 
## Cranium diameter of the newborn (mm): 340 
## Sex of the newborn: M 
## 
## The estimated weight of the newborn (g) is: 2594.537
weight_prediction(Gestation.w = 28, Sex = "F")
## Based on the following values:
##  
## Number of pregnancies: 1 
## Weeks of gestation: 28 
## Length of the newborn (mm): 500 
## Cranium diameter of the newborn (mm): 340 
## Sex of the newborn: F 
## 
## The estimated weight of the newborn (g) is: 2521.262

By entering the values of all the variables included in the model, the predicted weight is estimated with the highest precision allowed by the model:

weight_prediction(Pregnancies.n = 2, Gestation.w = 33, Length = 512, Cranium = 345, Sex = "F")
## Based on the following values:
##  
## Number of pregnancies: 2 
## Weeks of gestation: 33 
## Length of the newborn (mm): 512 
## Cranium diameter of the newborn (mm): 345 
## Sex of the newborn: F 
## 
## The estimated weight of the newborn (g) is: 3179.727

7. Visualizations

Finally, several plots are created to visualise some of the most important relationships between the variables included in the model.

For example, the relationship between the response variable Weight and the explanatory variable Gestation.w is shown by a scatter plot with colours representing Sex (also showing the corresponding regression lines) and different point sizes for Length:

ggplot(data = df) + 
  geom_point(aes(x = Gestation.w, y = Weight, colour = Sex, size = Length),
             alpha = 0.2,
             position = "jitter") +
  geom_smooth(aes(x = Gestation.w, y = Weight, colour = Sex),
              se = FALSE, method = "lm") +
  labs(title = "Impact of Gestation, Sex and Length on Weight")

Then the relationship between Weight and Length is visualized, using point size to represent the Gestation.w variable:

ggplot(data = df) + 
  geom_point(aes(x = Length, y = Weight, colour = Sex, size = Gestation.w),
             alpha = 0.2,
             position = "jitter") +
  geom_smooth(aes(x = Length, y = Weight, colour = Sex),
              se = FALSE, method = "lm") +
  labs(title = "Impact of Length, Sex and Gestation on Weight")

Using the plotly library, a three-dimensional scatter plot is created to display the joint effect of Gestation.w and Length on the response variable Weight, keeping two different colours for Sex:

fig <- plot_ly(data = df, 
               x = ~Gestation.w, y = ~Length, z = ~Weight, 
               color = ~Sex,
               opacity = 0.5,
               marker = list(size = 5)) %>%   
  layout(title = "Impact of Gestation, Length and Sex on Weight",
         legend = list(title = list(text = "Sex")))

fig

Morever, in this way the point size can also be used to include a fifth variable, for example Pregnancies.n:

fig <- plot_ly(data = df, 
               x = ~Gestation.w, y = ~Length, z = ~Weight, 
               color = ~Sex, size = ~Pregnancies.n, sizes = c(50, 500),
               text = ~paste('Number of pregnancies:', Pregnancies.n)) %>%   
  layout(title = "Impact of Gestation, Length, Sex and Pregnancies num. on Weight",
         legend = list(title = list(text = "Sex")))

fig

8. Conclusions

This inferential analysis on clinical data from 2500 newborns allowed to identify the main factors that influence birth weight and to build a regression model capable of predicting newborn weight from measurable clinical variables. Such a model may be useful to rapidly detect possible anomalies and plan in advance the use of intensive care.

The analysis has also allowed several preliminary questions to be answered. Specifically, it was found that caesarean sections do not appear to be significantly more frequent in one hospital than in the others. Moreover, the mean of weight in the three hospitals considered is not significantly different from that of the reference population, whereas length is significantly lower. Lastly, the hypothesis that the anthropometric measurements (weight, length and cranium diameter) are significantly different between the two sexes has been confirmed.

Some interesting conclusions emerged from the multiple regression analysis. For example, it was found that factors such as maternal smoking and mother’s age do not significantly affect birth weight. Instead, duration of gestation and length have a strong effect on weight, and their relationship with weight is not linear, as shown by the improvement obtained by including quadratic terms. However, further work including other samples obtained from a higher number of hospitals or additional variables could lead to different results and more generalizable conclusions.